Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add split one token into several (resolves #2838) #3253

Merged
merged 10 commits into from
Feb 14, 2019

Conversation

grivaz
Copy link
Contributor

@grivaz grivaz commented Feb 8, 2019

Adds the capacity to split tokens

Description

Implemented the feature as described in #2838. I added several error messages, and a new test suite for the functionality.

The tests passed with 554 warnings, but I think this is not related to this change.

Does this change require new documentation?

Types of change

New feature and new tests.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Feb 8, 2019
@ines ines requested a review from honnibal February 10, 2019 17:43
spacy/errors.py Outdated Show resolved Hide resolved
spacy/tokens/_retokenize.pyx Outdated Show resolved Hide resolved
spacy/tokens/_retokenize.pyx Outdated Show resolved Hide resolved
@honnibal
Copy link
Member

Thanks a lot for this! It'll be a really useful change. The main question remaining is whether we want the API to take an index, or a Token object. I'm leaning towards Token object. Keen to get your thoughts.

@honnibal
Copy link
Member

I think there was a small problem before -- the token.idx attribute, which holds the character offset for the token, wasn't being updated. I made a change and tested for it, hopefully it goes green.

I also went ahead and made the change to the retokenizer.split() function, so that it takes a Token object instead of the index.

@honnibal honnibal merged commit 3981551 into explosion:master Feb 14, 2019
@grivaz
Copy link
Contributor Author

grivaz commented Mar 4, 2019

Sorry, I didn't see your input. Thank you for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants